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D e scription TITLE OF THE INVENTION 

METHOD, COMPUTER PROGRAM PRODUCT W I TH PROGRAM CODE MEANS AND 
COMPUTER PROGRAM PRODUCT FOR A NALYSIS OF A REGULATORY GENETIC 

NETWORK OF A CELL 

CROSS REFERENCE TO RELATED APPLICATIONS 

rOOOH This application is based on and hereby claims priority to German Patent Application 
No. 10330280.8 filed on July 4, 2003, the contents of which are hereby incorporated by 
reference. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

[0002] The invention relates to an analysis of a regulatory genetic network of a cell using a 
statistical method. 

2. Description of the Related Art 

[0003] Fundamentals of a regulatory genetic network of a cell are known from444 Stetter et 
al., Large-Scale Computational Modeling of Generic Regulatory Networks. Kluwer Academic 
Publisher, Netherlands, 2003 . Such a regulatory genetic network should be taken in this 
document to mean in particular regulatory interactions between genes of a cell. 

[0004] A genome, i.e. the human genetic substance, is estimated to comprise 20,000 to 
40,000 genes, of which a biologically specified number in each case- depending on a 
specialization of a cell - are present in the cell in the form of a DNA or a part of a DNA. 

[0005] A not necessarily contiguous section of this DNA containing the genetic code for a 
protein or also for a group of proteins or for creating a protein or a group of proteins is 
designated as a gene here. Overall the genes contain a genetic code for around a million 
proteins. 



[0006] An interplay or the interactions between the genes as well as with the proteins 
represents the most important part of a machinery (regulatory genetic network) which underlies 
the development of a human body from a fertilized egg cell as well as all bodily functions. 

[0007] It is also known from PH -Stetter that so-called gene expression rates which form a 
gene expression pattern supply a description or representation of a regulatory genetic network 
or of a current status of the regulatory genetic network. 

[0008] In simple terms or expressed more clearly the gene expression pattern of a cell thus 
represents a state of the regulatory genetic network of this cell. 

[0009] It is further known that by using high-throughput gene expression measurements 
(microarray data) these gene expression rates can be measured. The microarray data in its turn 
describes snapshots of the gene expression pattern. 

[0010] Many illnesses and malfunctions of the body are attributable to disturbances in the 
regulatory genetic network which is reflected by greatly changed gene expression behavior 
(gene expression rates) or a changed gene expression pattern of a cell. 

[0011] An understanding of the regulatory genetic network thus represents an important step 
on the path to a characterization of the understanding of genetic mechanisms as well as 
consequently of identification of what are known as dominant or malfunction-initiating genes 
underlying the illnesses or malfunctions. 

[0012] In cancer research for example suppressing genes can play a key role in the 
identification of growths and tumors, the knowledge of new potential oncogenes and their 
interactions with other genes can be a contribution to discovering the basic principles (of 
cancers) which determine how normal cells change into malignant cancer cells. 

[0013] Furthermore a quantitative understanding of the regulatory genetic network of a cell is 
necessary for developing improved medicaments and therapies for fighting genetic diseases. 

[0014] Thus a number of medicaments act as agonists or antagonists of specific target 
proteins, i.e. they strengthen or weaken the function of a protein with corresponding effect on 
the regulatory genetic network with the aim of bringing this back into a normal function mode. 



2 



[0015] A description of a regulatory genetic network of a cell using a statistical method, a 
causal network is known from42 1 DE 10159262.0 . 

[0016] A causal network, a Bayesian network, is known from43 i Jensen, An Introduction to 
Bavesian Networks, UCL Press, London, 1996 . 

Bayesian networks 

[0017] A Bayesian network B is a specific type of presentation of a common multivariate 
probability density function (WDF) of a set of variables X by a graphical model which consists of 
two parts. 

[0018] It is defined by a directed acyclic graph, DAG) G - of the first component, in which 
each node i = 1 , ...,n corresponds to a random variable Xj. 

[0019] The connectors between the nodes represent statistical dependencies and can be 
interpreted as causal relationships between them. The second component of the Bayesian 
network is the set of conditional WDFs P(X / |Pa /l 0, G), which are parameterized by m e ans of a 
vector 0. 

[0020] These conditional WDFs specify the type of dependencies of the individual variables* 
of the set of its parents Pa \. Thus the common WDF can be broken down into the product form 



(1) 




[0021] The DAG of a Bayesian network uniquely describes the conditional dependency and 
independency relationships between a set of variables, but by contrast a given statistical 
structure of the WDF does not result in any unique DAG. 

[0022] Instead it can be shown that two DAGs describe one and the same WDF, if and only if 
they feature the same set of connectors and the same set of "colliders", with a collider being a 
constellation in which at least two directed connectors lead to the same node. 



3 



t 



SUMMARY OF THE INVENTION 

[0023] Ihe-An object of the invention is to specify a method which allows an analysis of a 
regulatory genetic network of a cell, for example represented by at least one gene expression 
pattern of the cell. 

[0024] A further object of the invention is to specify a method which enables a defective gene 
to be identified, for example a cancer or tumor gene, in the regulatory genetic network of a cell. 

[0025] Further the invention is designed to allow a simulation and/or an analysis of an effect 
of a medicament on the regulatory genetic network of a cell. Th i s obj e ct i s ach ie v e d by th e 
method, th e comput e r program product w i th program cod e m e ans and th e comput e r program 
product for ana l y sis of a r e gulatory g e n e t i c n e twork of a c e ll with th e f e atur e s accord i ng to th e 
r ele vant i nd e p e nd e nt pat e nt c l a i m. 

[0026] In the basic method for analysis of a regulatory genetic network of a cell a causal 
network is used, 

[0027] - saidthe causal network describing the regulatory genetic network of the cell such 
that nodes of the causal network represent genes of the regulatory genetic network and 
connectors of the causal network represent regulatory interactions between the genes of the 
regulatory genetic network 

[0028] In the analysis method a gene expression rate is now specified for a selected gene of 
the regulatory genetic network. Using the causal network a resulting gene expression pattern is 
generated for the predetermined gene expression rate for the regulatory genetic network. The 
resulting gene expression pattern generated is subsequently compared with a predetermined 
gene expression pattern of the regulatory genetic network. Th e comput e r program product w i th 
program cod e meanc i o cot up to e x e cut e a ll th e st e ps i n accordanc e w i th th e inv e nt i v e m e thod 
wh e n th e program is execut e d on a comput e r. Th e comput e r program product with program 
cod e m e ans stor e d i n machine readabl e form on a data m e d i um i s s e t up to e x e cut e all th e 
s t e p s i n accordanc e w i th th e m e thod i n accordanc e with th e i nv e ntion wh e n th e program i s run 
on a comput e r. Th e arrang e m e nt and a l so th e computer program product w i th program cod e 
m e ans s e t up to e x e cuto a ll steps i n accordanc e w i th th e i nv e nt i v e m e thod when th e program is 
run on a comput e r, ao wo l l as the computer program product with program cod e m e ans stor e d 
on a mach i n e -r e adab l e med i um, set up to e x e cut e a ll steps i n accordanc e w i th th e i nv e nt i v e 



4 



m e thod wh e n th e program is e x e cut e d on a comput e r ar e e sp e cia ll y suit e d to e x e cut e th e 
m e thod i n accordanc e w i th th e i nv e nt i on or of on e of i ts furth e r d e v el opm e nts li st e d b el ow. 

[0029] A probabilistic semantic of a causal network, such as of a Bayesian network, is very 
well suited to analysis of gene expression rates, given for example in the form of microarray 
data, since it is adapted to the stochastic nature both of biological processes and also to 
experiments susceptible to noise. 

[0030] Furthermore, viewed in illustrative terms, an effect of an expression state of specific 
genes on a global gene expression pattern (inverse modeling) is estimated, in that a resulting 
gene expression pattern is analyzed. Pr e f e rr e d d e v el opm e nt s of th e i nv e ntion a r e produc e d by 
the d e p e nd e nt claims. 

[0031] The developments described below relate to both the method and to the configuration. 

[0032] The invention and the developments described below can be implemented both in 
software and also in hardware, for example by using a specific electrical circuit. Furth e r th e 
r e a li zation of th e inv e nt i on or of a d e v e lopm e nt d e scr i b e d b el ow is poss i b le through a comput e r - 
r e adabl e s torag e m e d i um on wh i ch a comput e r program product w i th program cod e m e ans is 
stor e d wh i ch e x e cut e s th e i nv e ntion or d e v el opm e nt. A l so th e inv e ntion or any d e v el opm e nt of 
i t d e scr i b e d b el ow can b e r e a li z e d by a comput e r program product which f e atur e s a s torag e 
m e d i um on which a comput e r program product w i th program cod e m e ans i s 6 tor e d wh i ch 
e x e cut e s th e i nvent i on or d e v el opm e nt. 

[0033] With a further development the selected gene is selected using the causal network by 
means of a dependency analysis. 

[0034] The gene expression rate of the selected gene can also be predetermined such that 
the predetermined gene expression rate of the selected gene reflects an assumption of a gene 
defect. 

[0035] A Bayesian network can be used as the causal network. 

[0036] The causal network can also be of a type DAG (Directed Acylic Graph). 
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[0037] Furthermore the generated resulting and/or the predetermined gene expression 
pattern can represent discrete gene states, with the represented discrete gene states being able 
to be a an overexpressed, a normal or an underexpressed gene state. 

[0038] In a further development the generated resulting gene expression pattern can be 
compared with the predetermined gene expression pattern using a static method and/or of a 
statistical code, especially a measure of distance. 

[0039] There can also be provision for the causal network to be trained using gene 
expression patterns, with the nodes and the connectors of the causal network being adapted. 

[0040] Furthermore it is expedient for the gene expression patterns, especially the 
predetermined gene expression pattern and/or the gene expression patterns for training, to be 
determined using a DNA microarray technique. 

[0041] In one embodiment the predetermined gene expression pattern and/or the gene 
expression pattern for training is a gene expression pattern of a genetic regulatory network of a 
diseased cell. 

[0042] Here for example the diseased cell can be a cancer cell, especially a oncocell with 
ALL (Acute Lymphoblastic Leukemia). 

[0043] Furthermore the diseased cell can feature an oncogene, especially an ALL oncogene. 

[0044] Also for a plurality of selected genes of the regulatory genetic network one gene 
expression can be predetermined in each case, a plurality of resulting gene expression patterns 
generated and/or a plurality of comparisons undertaken. 

[0045] In a further development the generation of the plurality of resulting gene expression 
patterns is performed iteratively. 

[0046] Furthermore the inventive procedure or development is particularly suitable for 
identifying a dominant gene and/or a degenerated/mutated/diseased gene/oncogene/tumor- 
suppressor gene. 

[0047] It is also suitable for identifying a tumor cell, for example in connection with cancer 
detection. 
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[0048] Further the inventive method is especially suited to analyzing the causes of an 
abnormal gene expression pattern/ gene expression rate. 

[0049] It can also be used for a simulation and/or analysis of the effects of a medicament. 
BRIEF DESCRIPTION OF THE DRAWINGS 

[0050] Th e f i gur e s show an These and other objects and advantages of the present invention 
will become more apparent and more readily appreciated from the following description of an 
exemplary embodiment of the invention which i s e xp l a i n e d in mor e d e ta il b el ow. Th e figur es 
show, taken in conjunction with the accompanying drawings of which: 

Figure 1 is a ctefam- flowchart of a procedure for investigating genetically-related causes 
of illness through Bayesian inverse modelling using a cancer as an example; 

Figure 2 is a digram w i th procedural listing for an algorithm for creating a data set of N 
samples in accordance with an exemplary embodiment; 

Figure 3 is a d i gram of procedural listing for a procedure for creating data sets, which 
reflect an effect of different observations in accordance with an exemplary embodiment; 

Figures 4a and 4b d i grams are graphs which show that data obtained by sampling show 
subtype characteristic expression patterns as also in an original data set; 

Figure 5 is a d i agram graph w hich shows graphically a probability of each subtype under 
a condition which is overexpressed on a gene, for all 271 genes; 

Figure 6 a d i agram of js a graph structure of a causal network, which represents a 
regulatory genetic network. 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

[0051] Reference will now be made in detail to the preferred embodiments of the present 
invention, examples of which are illustrated in the accompanying drawings, wherein like 
reference numerals refer to like elements throughout. 
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Exemplary embodiment Investigation of genetically-related causes of diseases using 
Bayesian inverse modelling using a cancer as an example (espec. Fig.1) 

Overview of the Bayesian Inverse Modelling (BIM) procedure 

[0052] In many areas of empirical research the desire is to reach conclusions from the 
observation of trial results about the underlying principle and its causes - the relationship 
between "cause" and "effect". 

[0053] For example in cancer research the underlying principle is studded which causes a 
normal cell to transform it into a malignant, rapidly growing cancer cell. 

[0054] The effect of the various types of cancer is known, e.g. the general appearance of a 
cancer cell compared to a normal cell, measured with the aid of microarray chips. 

[0055] By contrast the cause of its origination is largely unknown. 

[0056] On the basis of the understanding that cancer is a genetic illness and that it is 
attributable to a deviation in the behavior of cells, the research is concentrating on discovering 
the genetic principles which are responsible for the development of the cancer. 

[0057] An important task in this environment is to identify genes which can play a role in 
tumor genesis, such as for example growth and tumor-suppressing genes. 

[0058] A procedure is described below with which it is possible to identify genes which are a 
potential cause of tumor genesis. 

[0059] One element of the procedure is a statistical method, in this case a Bayesian network 
{SHsee Jensen, above and subsequent associated embodiments for more details), which is 
learnt f2 l(see DE 10159262.0) from a microarray data set W-as described in Stetter (see 
"Structural learning" below) (cf. Fig. 1) . 

[0060] In this case it is assumed that the set of the measured gene expression vectors X 
belong to a basic totality with a highly-dimensional multivariate probability density function which 
is modelled with the aid of Bayesian network with adaptive network structure. 

[0061] The relationships between the variables, namely the conditional dependences and 
independences, are represented by m e an6 of a Directed Acyclic Graph (DAG) G . 
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[0062] The probabilistic semantic of the Bayesian network is very well suited to the analysis 
of microarray data since it is adapted to the stochastic nature both of the biological processes 
and also of the experiments susceptible to noise. 

[0063] In the procedure described below the learnt Bayesian network will be used as a 
generative model for taking samples of artificial microarray data sets which supplies the learned 
conditional probability density distributions (cf. Fig.1, step 110- 130) . 

[0064] Furthermore the effect of the expression state of specific genes on the global gene 
expression pattern (inverse modelling) is estimated, in that a resulting data set is analyzed (cf. 
Fig.1. step 110 -130) . 

[0065] In the procedure described below each gene is also assigned its probability, with 
which it is the cause of these cell states. 

[0066] To this end these data sets are compared with data obtained from microarray 
investigations of various known cell states (cf. Fig.1, step 130) . 

[0067] Seen in general terms, the procedure does not concentrate explicitly on the structures 
of the network, but rather on the probability distribution which is derived from the learnt 
Bayesian network. 

[0068] Finally the procedure is applied to microarray data of different subtypes of pediatric 
acute lymphoblastic leukemia (ALL) of Yeoh et a L "Classification, Subtype Discovery, and 
Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profile", 
Cancer Cell. 2002, pp. 133-1434 44. 

[0069] The comparison of the artificial data with expression patterns of specific cancer 
subtypes enables a measure of probability of the illness-causing behavior of each gene (cf. 
Fig.1, step 130) to be obtained. 

[0070] Results of the applied procedure show that, in connection with Bayesian Inverse 
Modelling (BIM) this allows the effect of pathogenetiically modified expression levels on the 
global gene expression pattern to be predicted, in which case already known oncogenes as well 
as potential new ones are found. 
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Bayesian networks 



[0071] The basic principles of Bayesian networks f31- as described in Jensen have already 
been described above. 

[0072] In the case of the modelling of a regulatory genetic network by a Bayesian network 
genes or their corresponding proteins are symbolized by nodes. 

[0073] Regulation mechanisms are described by connectors between two nodes, which can 
be interpreted in a causal manner. 

[0074] The quality of the regulation is encoded in the conditional probability distribution of the 
gene involved for given regulators of the same. 

Structural learning 

[0075] The process of structural learning can be described as follows: 

[0076] Let D = {d\ d 2 , ... , c^} be a data set of N independent observation, with each data 
point being an n-dimensional vector with components d 1 = {d\ d 1 2 , ... , d 1 N }. For a given D the 
structure G of the Bayesian network is to be found which best corresponds to D, i.e. which 
maximizes the Bayes-Score, 




[0077] with P{D\G) the being the peripheral probability, P(G) the apriori probability of the 
structures and P(D) the evidence. 

[0078] Since both the apriori probabilityand also the evidence are unknown, the problem is 
reduced to determining the structures with the best peripheral probability corresponding to the 
data (Heckerman et al. , "Learning Bayesian networks: The combination of knowledge and 
statistical data", Machine Learning, vol. 20, 1995, pp. 197-2434 51). 
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[0079] If the data set D consists of N microarray experiments, e.g. of cell samples of different 
patients, each data vector {d\, d 1 2 , ....d'n} represents the expression profile of n genes in a 
microarray experiment. 

[0080] A Bayesian network learnt from such data encodes the probability distribution of n 
genes, which were obtained from these N microarray experiments. 

Bayesian Inverse Modelling (BIM) 

Generative model 

[0081] A learnt (see notes above about "structural learning") Bayesian network 6 represents 
a density estimation function which reflects the probability distribution of the data set D, on the 
basis of which it was learnt, with the aid of the set of conditional WDFs. 

[0082] This means that it can be used as a generative model for creating a data set D B which 
reflects the density distribution obtained from D. 

[0083] Fig. 2 shows an algorithm 200 for creating a data set of N samples from 8. 

[0084] The first step 21 0 of the algorithm 200 consists of arranging all variables such that the 
parents (parent nodes) Pa, are instantiated before X h 

[0085] Subsequently the variables corresponding to the arrangement are selected and 
instantiated with a value 220. 

[0086] The value of each variable is selected with the probability P(state|Pa/). This step is 
repeated 230, until N samples are created. 

Probabilistic interference 

[0087] A significant problem in Bayesian networks is the evidence propagation, meaning the 
determination of the aposteriori distribution P(X q \E) of a request variableX q , if a certain evidence 
E has been observed in the Bayesian network. 

[0088] As a result of the definition of a conditional probability, the aposteriori probability is 
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with X E designating the quantity of the observed variables. 

[0089] To overcome the time complexity, the different methods of exact interference 
calculation use the general principle of dynamic programming. 

[0090] As part of this exemplary embodiment a simple interference algorithm, of "bucket 
elimination"-^, as described in Dechter. R., "Bucket Elimination: A unifying framework for 
probabilistic inference", Uncertainty in Artificial Intelligence, UAI 196, pp. 211-219, is 
used. 

[0091] The basic idea with this interference algorithm consists of eliminating variables one 
after the other in accordance with an order of elimination p by summation. ^ 

[0092] In this way P{X q \E) can be efficiently calculated within a perceivable time. 

Interventional modelling by setting the evidence 

[0093] With the interventional modelling approach the effect of specific observation on the 
behavior of the Bayesian network using a combination of probabilistic interference and data 
sampling is estimated. 

[0094] In accordance with Fig. 3 the Bayesian network can be viewed as a kind of black box 
300, with the input being given by a set of observations E 310 and the corresponding list of 
observed variables X E 320. 

[0095] The output, which is given by the data set D B \e 330 is created using the method 
previously explained in association with Fig. 2 . 

[0096] In addition the empirical evidence is to be taken into account. 

[0097] Consequently each state of X, is selected with probability P(state|Pa^E), which is 
calculated by m e ans of probabilistic interference. 
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[0098] With the procedure described in accordance with Fig. 3 different data sets can now be 
created which reflect the effect of the different observations. 

[0099] If, as described below, biological effects are analyzed, this means that through this 
method of operation in accordance with Fig. 3 artificial microarray data can be created which 
reflects the probability distribution of a certain data set if specific observations are given. 

[0100] If the artificially created data from a known origin is compared for example with a 
cancer-specific set of measurement data, those genes can be determined which, when they are 
fixed at a certain expression level, will influence the model so that these two microarray data 
sets, the artificial and the known, exhibit the same characteristics. 

Statistical comparison of data sets 

[0101] In order to estimate the quality of the influence of the evidence I on the behavior of the 
Bayesian network I, the created data set D B \e is compared with a set of data sets I of known 
states S. 

[0102] It is assumed that D describes the effect of different types of cancer. In accordance 
with the embodiment the behavior of evidence E relating to a specific type of cancer S can now 
be described. 

[0103] By using a measure of distance the change a of the correlation between D B \e and Ds 
as a result of E can be estimated: 



with the distance between the two data sets having been standardized with the aid of the 
distance between D s , which was taken from B without evidence, and Ds. 

[0104] As a result, in accordance with the embodiment, the influence of an observed 
evidence is measurable, e.g. the expression state of a specific gene on a behavior of the model 
characteristic for cancer. 



(4) 
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[0105] Secondly the probability can be calculated of B creating a data set DBIE which is 
equal to Ds for a given E. 

[0106] For this purpose an estimate is made of how many samples d' of D B \e lie closest to Ds 
in that the distance between each sample and each data set is calculated by D. 

[0107] The aposteriori probability P(S\E) of the occurence of the cancer type S for given 
evidence E is thus obtained: 



with N es being a number of samples of D B \e, which is statistically closest to the data set DS, and 
with N being the total number of samples of D B \e. 

[0108] As already pointed out above, empirical research deals with the relationship between 
cause and effect, in that it draws conclusions about the underlying cause from experimental 
observation. 

[0109] With the Bayesian Inverse Modelling approach in accordance with the exemplary 
embodiment an underlying cause is estimated by first creating an effect which stems from a 
known observation. 

[01 10] After this inverse step this effect is compared with effects which are well-defined but 
for which the cause is unknown. 

[01 1 1] The potential cause of the best-match effect is then given by the observation which 
gives rise to the created effect. 

The ALL microarray data set of Yeoh et al.-f4} 

[01 12] The data which is used for the analysis in accordance with the exemplary embodiment 
consists of 327 samples of various subtypes of pediatric acute lymphoblastic leukemia (ALL). 

[0113] The data set was assembled by Yeoh and his colleagues at the St. Jude Children's 
Research Hospital-f4}. 



(5) 
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[0114] ALL is a heterogeneous illness which includes different subtypes, including both T-cell 
type leukemia and B-cell type leukemia, which differ as regards their reaction to a medical 
treatment. 

[0115] Apart from T-ALL, of which the cause is not clearly known, each B-cell subtype can be 
traced back to a specific genetic modification, e.g. to genetic translocations t(9;22) [BCR-ABL], 
t(1;19) [E2A-PBX1], t(12;21) [TEL-AML1], t(4;11) [MLL] or to a hyperdiploid karyotype [> 50 
chromosomes]. 

[01 16] No wonder then that the gene expression patterns of the different subtypes differ very 
markedly from one another. 

[0117] Furthermore microarray data exhibits one more clear expression profile which points 
to the existence of a further ALL subtype in addition to the 6 known. 

[0118] It should be pointed out that Yeoh et al. {4}-are working on a robust classification for 
classifying the subtypes using a support vector machine with a set of 271 discriminating genes. 

Results 

Learnt structure 

[0119] For analysis in accordance with the exemplary embodiment the reduced data set of 
271 genes and 327 samples of different ALL subtypes-{4}, as described above with respect to 
the work by Yeoh et al. , is used. 

[0120] To perform the learning process of a multivariate model the data set in the values has 
been divided up into the discrete value "under-expressed", "expressed normally" and "over- 
expressed". 

[0121] The learnt structure shows scale-free characteristic values, a feature which is typical 
of biological networks, such as for metabolic networks or signaling networks. 

[0122] Such networks are characterized by a power distribution of the ranges of a node 
which is defined as the number of connections to other nodes. 

[0123] These nodes have a strong influence on the dynamics and robustness of scale-free 
networks, and of many of these strongly connected genes in our model it is actually known that 
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they play a role in the ocogenesis or in the critical processes associated with the development 
of cancer, e.g. DNA repair. 

[0124] First a data set of 300 samples is now created from the model in order to estimate the 
statistics which are defined by the set of the conditional probabilities. 

[0125] F i g. 4 shows Figures 4a and 4b show that data obtained by taking samples (Fig. 4b) 
shows subtype characteristic expression patterns, as is also the case in the original data set 
( Fig. 4a) . 

[0126] The patterns of a number of subtypes such as E2A-PBX1 or T-ALL, are reproduced 
very well whereas others are generated less well, e.g. the pattern of the subtype MLL, or are 
missed completely such as for example BCR-ABL. 

Modelling of leukaemia subtypes by intervention 

[0127] The learnt Bayesian network is the basic starting point for the exemplary embodiment 
for the approach adopted of using inverse modelling to find those genes which, when fixed at a 
specific expression level, influence the model such that the generated artificial microarray data 
set exhibits specific characteristics. 

[0128] As described above, the probability P(C\E) of creation of specific cancer subtype C is 
estimated if a certain observation E is given, in this case the expression state of a specific gene 
P(C| Genpstate) . 

[0129] By contrast with Yeoh, not only the presence of a specific cancer subtype is predicted, 
but genetic mechanisms which lead to its creation. 

[0130] A high probability indicates that the fixed gene is a potential cause for the subtype- 
specific expression behavior of the gene in question, which in its turn can be the underlying 
cause of a specific cancerous appearance. 

[0131] 7 reference data sets are used for the comparison, with each of these having been 
obtained in conjunction with a specific ALL subtype. 

[0132] FIG.4a shows that the original microarray data set is clearly subdivided into 7 clusters 
(accumulations of points) with different sample extents. 
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[0133] Each of these clusters represents the expression pattern of 271 genes if a specific 
subtype of leukaemia is given, and has been used to to measure the influence of an evidence 
for the occurrence of these different ALL subtypes. 

[0134] In a first step each gene is fixed for any one of its expression values, with all these 
conditions^ being used to to generate a data set of 300 samples (Fig,4b) . 

[0135] Subsequently all this data is compared with the 7 reference data sets, as explained 
previously. 

[0136] In Fig. 5 the probability of each subtype, under the condition that a gene is 
overexpressed, is shown on a graph for 271 genes. 

[0137] Fig.5 shows that a small number of genes exist which are very likely to trigger a 
specific ALL subtype if they are strongly active. 

[0138] To verify these results the molecular function of specific genes and their role in 
biological processes, especially as regards pathogenesis, is examined in more detail below. 

Biological insights 

[0139] These are obtained by examining in greater detail the genes which are very probably 
the cause of a specific subtype as well as significant structure patterns in the learnt network, i.e. 
dominant genes and their environment. 

[0140] The learnt Bayesian network (model) results from the microarray data set of different 
leukaemia subtypes and reflects transcriptional relationships between genes which occur in 
these malignant cancer cells. 

[0141] Thus genes which trigger a specific subtype are either potential oncogenes or are 
regulated by such genes. 

[0142] The first gene to be analyzed in more detail is the gene PBX1 . 

[0143] If it is overexpressed the learnt Bayesan network creates a data set with 0.96 
probability which is characteristic of the subtype E2A-PBX1 of the ALL off B-cell type (see 
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[0144] This makes the obvious assumption that a causal relationship between the 
"overexpression" of this gene and the occurrence of the ALL subtypes E2A-PBX1 is present. 

[0145] And in actual fact PBX1 s known as a proto ocogene which causes normal blood cells 
to mutate into malignant ALL cancer cells. 

[0146] As a result of the chromosome translocation t(1 ;1 9) PBX1 merges with the gene E2A 
and transform into a potent ocogene which causes the leukemia subtype E2A-PBX1 . 

[0147] Since the graph structure of the model (Fig. 6) can further be interpreted in a causal 
manner it provides information about the interaction between potential oncogenes and other 
genes which in its turn can be interpreted as an oncogene regulation. 

[0148] Ithe structure of the network (Fig.6) is considered, PBX1 represents a dominant gene 
in that it influences many other genes but is only regulated by one or a few other genes. 

[0149] In addition, as a result of the conditional probability distribution, the model identifies 
PBX1 as a transcription activator. 

[0150] This can also be explained by known biological facts, since PBX1 activates genes 
which are normally not expressed or are expressed at a low level. 

[0151] Patients with a hyperdiploidy of > 50 chromosomes have clones of 51-68 
chromosomes. Although high hyperdiploid clones are seldom identical, they tend to exhibit a 
pattern of the chromosome increase with additional copies of the chromosomes 4, 6, 10, 14, 18 
and 21. 

[0152] Trisomy and Polysomy 21 are non-random anomalies which are frequently to be 
observed with ALL Their occurrence, even if it is not specific, as well as the increased 
occurrence of acute leukaemia or in subjects with constitutional Trisomy 21 make it reasonable 
to assume that the chromosome 21 has a particular role to play in leukemogenesis. 

[0153] Another disease, Down's Syndrome, is caused by Trisomy 21 and shows an 
increased occurence of leukemia such as ALL. 

[0154] As a result the method described makes it possible in this case, in accordance with 
the exemplary embodiment, to identify genes which to a large extent indicate the hyperdiploid 
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ALL subtype, of which however it is also known that they play a significant role in the occurrence 
of Down's Syndrome. 

[0155] The gene SOD1 is located at chromosom e21 and produces an enzyme which 
converts superoxide-free radicals into hydrogen peroxide. The increased expression at Trisomy 
21, which is also to be observed for the microarray samples of patients with hyperdiploid 
karyotype, can give rise to the brain damage which is to be seen with Down's Syndrome. 

[0156] The frequency of the occurence of the hyperdiploid ALL also increases in the case in 
which the gene PSMD10 is overexpressed. 

[0157] PSMD10 is a regulatory cluster unit of the proteasome 26S for which it has been 
shown that is operates as a natural mechanism for the breakdown of protein by regulating the 
protein metabolism in eukaryotic cells 

[0158] This is of significance for cancers in humans since the cell cycle, the growth of the 
tumor and-the survival are determined by a great vraiety of intracellular proteins which are 
regulated by the ubiquitin-dependent proteasome breakdown path which is influenced by 
PSMD10. 

[0159] In more recent scientific work it has been verified that this breakdown path is often the 
object of a deregulation associated with cancer and can be subject to such processes as 
oncogene transformation, tumor progression, bypassing of the immune system and resistance 
to medicaments. 

Abstract of the exemplary embodiment 

[0160] The exemplary embodiment described presents a new method by which it is possible 
to identify genes which are a potential cause of tumorgenesis, by analyzing the relationships 
between microarray data of leukemia subtypes and a data set, which is the result of taking 
samples from a learnt Bayesian network. 

[0161] This method of operation is based on the modelling of a regulator genetic network 
through a Bayesian network, with genes or their corresponding proteins being symbolized by 
the nodes of the Bayesian network. 
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[0162] Regulation mechanisms are described by connectors between two nodes, which can 
be interpreted in a causal manner. 

[0163] The quality of the regulation is encoded in the conditional probability dsitribution of the 
gene involved for given regulators of the same. 

[0164] The understanding of the regulatory genetic network represents an important step 
along the road to characterizing the genetic mechanisms underlying complex diseases. 

[0165] In cancer research, were the identification of genes which suppress growths and 
tumors plays a key role, the knowledge of new potential oncogenes and their interactions with 
other molecules is an important contribution to discovering the basic principles which determine 
why normal cells mutate into malignant cancer cells. 

[0166] With the procedure described in accordance with the exemplary embodiment, 
especially with Bayesian Inverse Modelling, it is possible to discover genes with such an 
oncogene characteristic simply through a statistical analysis of gene expression patterns, which 
have been measured with the aid of DNA microarrays. 

[0167] The underlying theoretical probability model which has been used, is a Bayesian 
network, which encodes the multivariate probability distribution of a set of variables by m e ans of 
a set of conditional probability distributions. 

[0168] The statistical dependencies are encoded in a graph structure. In the learning method 
Bayesian statistics are used to determine the network structure and the corresponding model 
parameters which best describe the probability distribution contained in the data. 

[0169] The invention has been described in detail with particular reference to preferred 
embodiments thereof and examples, but it will be understood that variations and modifications 
can be effected within the spirit and scope of the invention covered by the claims which may 
include the phrase "at least one of A, B and C" as an alternative expression that means one or 
more of A, B and C may be used, contrary to the holding in Superpuide v. DIRECTV, 
69 USPQ2d 1865 (Fed. Cir. 2004). 
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